HappyDB is a corpus of 100,000 crowd-sourced happy moments via Amazon’s Mechanical Turk. You can read more about it on https://arxiv.org/ay bs/1801.07746
In this R notebook, we process the raw textual data for our data analysis.
From the packages’ descriptions:
tm is a framework for text mining applications within R;tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures;tidytext allows text mining using ‘dplyr’, ‘ggplot2’, and other tidy tools;DT provides an R interface to the JavaScript library DataTables.library(tm)
library(tidytext)
library(tidyverse)
library(DT)
library(dplyr)
urlfile<-'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/cleaned_hm.csv'
hm_data <- read_csv(urlfile)
We clean the text by converting all the letters to the lower case, and removing punctuation, numbers, empty words and extra white space.
corpus <- VCorpus(VectorSource(hm_data$cleaned_hm))%>%
tm_map(content_transformer(tolower))%>%
tm_map(removePunctuation)%>%
tm_map(removeNumbers)%>%
tm_map(removeWords, character(0))%>%
tm_map(stripWhitespace)
Stemming reduces a word to its word stem. We stem the words here and then convert the “tm” object to a “tidy” object for much faster processing.
library(dplyr)
stemmed <-tm_map(corpus, stemDocument) %>% #stemDocument:Stem words in a text document using Porter's stemming algorithm.
tidy() %>%
select(text)
#change to dataframe
We also need a dictionary to look up the words corresponding to the stems.
dict <- tidy(corpus) %>%
select(text) %>%
unnest_tokens(dictionary, text)
We remove stopwords provided by the “tidytext” package and also add custom stopwords in context of our data.
data("stop_words")
word <- c("happy","ago","yesterday","lot","today","months","month",
"happier","happiest","last","week","past")
stop_words <- stop_words %>%
bind_rows(mutate(tibble(word), lexicon = "updated")) #tibble() is a nice way to create data frames.
Here we combine the stems and the dictionary into the same “tidy” object.
completed <- stemmed %>%
mutate(id = row_number()) %>%
unnest_tokens(stems, text) %>%
bind_cols(dict) %>%
anti_join(stop_words, by = c("dictionary" = "word")) #In the stop_words data object in tidytext, the column is called word and in your dataframe, it is called dictionary.
Lastly, we complete the stems by picking the corresponding word with the highest frequency.
completed <- completed %>%
group_by(stems)%>%
count(dictionary) %>%
mutate(word = dictionary[which.max(n)]) %>%
ungroup() %>%
select(stems, word) %>%
distinct() %>%
right_join(completed) %>%
select(-stems)
We want our processed words to resemble the structure of the original happy moments. So we paste the words together to form happy moments.
completed <- completed %>%
group_by(id) %>%
summarise(text = str_c(word, collapse = " ")) %>%
ungroup()
hm_data <- hm_data %>%
mutate(id = row_number()) %>%
inner_join(completed)
datatable(hm_data)